Attribution: A lot of the material for this lecture came from the following resources

Motivation

image source

Many classifiers has been applied to read the hand-written digits.

image source

In this lecture, we will explore one of the most sucessful classification methods (i.e. lost error rate) – Deep learning (aka Neural Network).

Fundamental Concepts in Deep Learning

Deep Learning is a branch of machine learning based on a set of algorithms that attempt to model high level and hierarchical representations in data using:

  1. Deep graph with multiple processing layers;
  2. multiple linear and non-linear transformations.

Basic assumptions:

  1. We can learn from the data everything we need to solve the task;
  2. Employing a large number of very simple computational units, we can solve even complex problems.

(Artificial) Neural Networks(ANN)

The perceptron Algorithm – Fundamental computational unit at the heart of

every Deep Learning model which is invented in 1957 by Frank Rosenblatt.

image source

For example, you can think of logistic regression as a special case of neural networks.

image source

Multilayer Perceptron(MLP)

The Single layer perceptron has difficulty to distinguish non-linearly separable data. One solution is to modify this model by introducing hidden layers between the input and output layers. This model is called Multilayer Perceptron. Ideally, MLP is a universal approximator for any function. Why and how? [Reference] (http://neuralnetworksanddeeplearning.com/chap4.html#universality_with_one_input_and_one_output)

image source

Mathematical Details

A neural network is a two-stage regression or classification model, This network applies both to regression or classification. For regression, typically \(K = 1\) and there is only one output unit \(Y_1\) at the top. However, these networks can handle multiple quantitative responses in a seamless fashion, so we will deal with the general case.

For K-class classification, there are \(K\) units at the top, with the kth unit modeling the probability of class k. There are \(K\) target measurements \(Y_k\), \(k = 1,\cdots,K\), each being coded as a 0 − 1 variable for the kth class. Derived features \(Z_m\) are created from linear combinations of the inputs, and then the target \(Y_k\) is modeled as a function of linear combinations of the \(Z_m\),

\[ \begin{aligned} Z_m &= \sigma(\alpha_{0m}+\alpha^TX), m = 1,\cdots,M,\\ T_k &= \beta_{0k} +\beta_k^TZ, k = 1,\cdots, K,\\ f_k(X) &= g_k(T),k = 1,\cdots, K, \end{aligned} \]

where \(Z = (Z_1, Z_2,...,Z_M)\), and \(T = (T_1, T_2,...,T_K)\).

The output function \(g_k(T)\) allows a final transformation of the vector of outputs T. For regression we typically choose the identity function \(g_k(T) = T_k\). Early work in K-class classification also used the identity function, but this was later abandoned in favor of the softmax function $$ g_k(T) =

$$

Fitting Neural Networks

Loss function

We first need to define our measure of fit (a.k.a loss function or cost function). As for the choice of the loss function many different loss functions can be designed for the same problem. An example of loss function for regression problem could be sum-of-squared errors,

\[ C(\theta) = \sum_{i=1}^n (y_i - f(x_i,\theta))^2. \] Typically we don’t want the global minimizer of \(C(θ)\), as this is likely to be an overfit solution. Instead some regularization is needed: this is achieved directly through a penalty term, or indirectly by early stopping.

Backpropogation

Let us recall gradient descent.

image source

let’s think about what happens when we move the ball a small amount \(\Delta v_1\) in the \(v_1\) direction, and a small amount \(\Delta v_2\) in the \(v_2\) direction. Calculus tells us that Cost function changes as follows:

\[ \begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 \equiv \nabla C \cdot \Delta v, \end{eqnarray} \] where \(\nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T\), \(\Delta v \equiv (\Delta v_1, \Delta v_2)^T\).

Then we want to move the ball to a new position \(v'\), \[ \begin{eqnarray} v \rightarrow v' = v -\eta \nabla C, \end{eqnarray} \] where \(\eta\) is the a small, positive parameter (known as the learning rate).
Another sight of this equation is that change \(\Delta C\) in the cost is related to the change \(\Delta v\) in position \(v\). Now, back to our neural network,

image source

image source

image source

\[ \Delta C \approx \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}. \]

This is exactly the chain rule for computing the derivative of the composition of two or more functions.

This is called backpropagation which essentially applied the gradient descent algorithm. The backpropagation algorithm is a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost.

For ANNs we have many hyper-parameters to tune and this is often pointed out as the major downside of these models:

  • Number of hidden layers
  • Number of hidden units
  • Number of training iterations
  • Learning rate
  • Regularization

Overfitting

Often neural networks have too many weights and will overfit the data at the global minimum of R. In early developments of neural networks, either by design or by accident, an early stopping rule was used to avoid overfitting. Here we train the model only for a while, and stop well before we approach the global minimum. Since the weights start at a highly regularized (linear) solution, this has the effect of shrinking the final model toward a linear model. A validation dataset is useful for determining when to stop, since we expect the validation error to start increasing.

A more explicit method for regularization is weight decay, which is analogous to ridge regression used for linear models. We add a penalty to the error function

\[ C(\theta) + \lambda ||\theta||^2, \] where \(\lambda\geq 0\) is a tuning parameter.

Example

Data

The data we will use is from [Kaggle] (https://www.kaggle.com/c/digit-recognizer/data) and is available in a .csv file.

A description of the data from Kaggle:

“The dataset was constructed from a number of scanned document dataset available from the National Institute of Standards and Technology (NIST). This is where the name for the dataset comes from, as the Modified NIST or MNIST dataset.”

“Images of digits were taken from a variety of scanned documents, normalized in size and centered. This makes it an excellent dataset for evaluating models, allowing the developer to focus on the machine learning with very little data cleaning or preparation required.”

“Each image is a 28 by 28 pixel square (784 pixels total). A standard spit of the dataset is used to evaluate and compare models, where 60,000 images are used to train a model and a separate set of 10,000 images are used to test it.”

“It is a digit recognition task. As such there are 10 digits (0 to 9) or 10 classes to predict. Results are reported using prediction error, which is nothing more than the inverted classification accuracy.”

library(h2o)

----------------------------------------------------------------------

Your next step is to start H2O:
    > h2o.init()

For H2O package documentation, ask for help:
    > ??h2o

After starting H2O, you can use the Web UI at http://localhost:54321
For more information visit http://docs.h2o.ai

----------------------------------------------------------------------

Attaching package: 'h2o'
The following objects are masked from 'package:stats':

    cor, sd, var
The following objects are masked from 'package:base':

    &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
    colnames<-, ifelse, is.character, is.factor, is.numeric, log,
    log10, log1p, log2, round, signif, trunc
h2o.init()

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /var/folders/5j/9xt6f40x6_7dvjkc46s6t8zm0000gn/T//RtmpMCaBeP/h2o_guoqingwang_started_from_r.out
    /var/folders/5j/9xt6f40x6_7dvjkc46s6t8zm0000gn/T//RtmpMCaBeP/h2o_guoqingwang_started_from_r.err


Starting H2O JVM and connecting: .. Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         2 seconds 835 milliseconds 
    H2O cluster timezone:       America/New_York 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.20.0.8 
    H2O cluster version age:    2 months and 2 days  
    H2O cluster name:           H2O_started_from_R_guoqingwang_npx197 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.56 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.5.1 (2018-07-02) 
dat = read.csv("../data/train.csv")
dat$label = as.factor(dat$label)
train = dat[1:5000,]
test = dat[5001:10000,]
validation = dat[10001:15000, ]

train.h2o = as.h2o(train)

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o = as.h2o(test)

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
str(train)
'data.frame':   5000 obs. of  785 variables:
 $ label   : Factor w/ 10 levels "0","1","2","3",..: 2 1 2 5 1 1 8 4 6 4 ...
 $ pixel0  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel1  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel2  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel3  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel4  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel5  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel6  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel7  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel8  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel9  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel10 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel11 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel12 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel13 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel14 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel15 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel16 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel17 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel18 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel19 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel20 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel21 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel22 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel23 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel24 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel25 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel26 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel27 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel28 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel29 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel30 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel31 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel32 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel33 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel34 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel35 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel36 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel37 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel38 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel39 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel40 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel41 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel42 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel43 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel44 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel45 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel46 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel47 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel48 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel49 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel50 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel51 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel52 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel53 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel54 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel55 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel56 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel57 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel58 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel59 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel60 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel61 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel62 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel63 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel64 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel65 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel66 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel67 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel68 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel69 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel70 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel71 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel72 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel73 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel74 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel75 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel76 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel77 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel78 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel79 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel80 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel81 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel82 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel83 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel84 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel85 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel86 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel87 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel88 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel89 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel90 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel91 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel92 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel93 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel94 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel95 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel96 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pixel97 : int  0 0 0 0 0 0 0 0 0 0 ...
  [list output truncated]
model = h2o.deeplearning(x = 2:785, y = 1,
                         training_frame = train.h2o,
                         hidden = c(5),
                         seed = 0)
Warning in .h2o.startModelJob(algo, params, h2oRestApiVersion): Dropping bad and constant columns: [pixel729, pixel448, pixel449, pixel724, pixel725, pixel726, pixel727, pixel728, pixel560, pixel52, pixel51, pixel54, pixel53, pixel168, pixel56, pixel169, pixel55, pixel58, pixel57, pixel59, pixel280, pixel559, pixel671, pixel672, pixel673, pixel674, pixel392, pixel393, pixel700, pixel701, pixel308, pixel141, pixel142, pixel780, pixel781, pixel782, pixel420, pixel783, pixel421, pixel140, pixel139, pixel777, pixel778, pixel779, pixel8, pixel9, pixel6, pixel7, pixel4, pixel5, pixel60, pixel252, pixel2, pixel3, pixel0, pixel1, pixel532, pixel644, pixel645, pixel364, pixel760, pixel10, pixel365, pixel12, pixel11, pixel643, pixel14, pixel13, pixel16, pixel15, pixel18, pixel17, pixel19, pixel754, pixel755, pixel756, pixel757, pixel758, pixel759, pixel83, pixel196, pixel82, pixel197, pixel85, pixel110, pixel84, pixel111, pixel87, pixel112, pixel86, pixel113, pixel476, pixel114, pixel477, pixel752, pixel88, pixel753, pixel504, pixel30, pixel32, pixel31, pixel223, pixel587, pixel33, pixel336, pixel699, pixel732, pixel615, pixel21, pixel20, pixel23, pixel697, pixel730, pixel22, pixel335, pixel698, pixel731, pixel25, pixel24, pixel27, pixel26, pixel29, pixel28].

  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |=================================================================| 100%
h2o.confusionMatrix(model, test.h2o)
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
         0   1   2   3   4   5  6   7    8   9  Error            Rate
0      436   0   1   2   1  25  2   4   23   3 0.1227 =      61 / 497
1        0 486   2   7   1   1  2  10   23   5 0.0950 =      51 / 537
2        7   5 176  37 128   3  3  18   90  33 0.6480 =     324 / 500
3       26  10   8 397   3  25  1  14   13  32 0.2495 =     132 / 529
4        0   4  50  10 285   1  0   3  114  23 0.4184 =     205 / 490
5       39   2   6  15   1 232  9  13   76  44 0.4691 =     205 / 437
6        8   5   2   5   9   3  7  13  426   9 0.9856 =     480 / 487
7        4  23  17  11   8  31  0 393   18  28 0.2627 =     140 / 533
8        7  31  16  19   9  14  0  10  353  14 0.2537 =     120 / 473
9        1   9  32  23   7  25  0  11   44 365 0.2940 =     152 / 517
Totals 528 575 310 526 452 360 24 489 1180 556 0.3740 = 1,870 / 5,000

Convolutional Neural Network

As we saw in the previous sections, regular neural networks receive an input (a single vector), and transform it through a series of hidden layers. However, one of the main drawbacks of regular neural networks is that:

  • Regular Neural Nets don’t scale well to full images.Let’s say we have to classify images of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have \(32\times32\times3 = 3072\) weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectable size, e.g. 200x200x3, would lead to neurons that have \(200\times200\times3 = 120,000\) weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.

Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth. (Note that the word depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) Here is a visualization:

image source

We can see a CNN as a sequence of layers, where each layer transforms one volume of activations to another through a differentiable function. There are three fundamental types of layers:

  • Convolutional Layer – compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and a small region they are connected to in the input volume. This may result in volume such as [32x32x12] if we decided to use 12 filters.
  • RELU layer – apply an elementwise activation function, such as the \(max(0,x)\) thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]).
  • Pooling Layer – perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12].
  • Fully-Connected Layer – compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, for example. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume.

Convoluntional Layer

In computer vision, a very typical approach for processing an image is to convolve it with a filter (or kernel) in order to extract only salient features from the image.

image source

More specifically,

image source

image source

We need more than one output feature maps, because each filter can extract different features from the input image. The number of feature maps for each conv. layer would be anther hyper-parameter to tune.

Polling Layer

It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting.

The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). The depth dimension remains unchanged.

image source

Fully-Connected Layer

Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. See the Neural Network section of the notes for more information.

Layer Patterns

The most common form of a ConvNet architecture stacks a few CONV-RELU layers, follows them with POOL layers, and repeats this pattern until the image has been merged spatially to a small size. At some point, it is common to transition to fully-connected layers. The last fully-connected layer holds the output, such as the class scores. In other words, the most common ConvNet architecture follows the pattern:

INPUT -> [[CONV -> RELU]*N -> POOL]*M -> [FC -> RELU]*K -> FC.

Moreover, \(N \geq 0\) (and usually \(N \leq 3\)), \(M \geq 0\), \(K \geq 0\) (and usually \(K < 3\)).

Example

We are using the same MNIST dataset that we introduced before, but applying the CNN.

Here we are introducing another commonly useful R package mxnet for CNN. The installation manual can be found [here] (https://en.wikipedia.org/wiki/MNIST_database#Performance).

detach("package:h2o", unload=TRUE)
[1] "A shutdown has been triggered. "
library(mxnet)
library(caret)
Loading required package: lattice
Loading required package: ggplot2
data = read.csv("../data/train.csv", header = T)
dim(data)
[1] 42000   785
train = data.matrix(data[1:5000, ])
test = data.matrix(data[5001:10000, ])

train.x = train[,-1]
train.y = train[,1]
test.x = test[,-1]
test.y = test[,1]

# normalize each value to 0-1
train.x = t(train.x/255)
test.x = t(test.x/255)


# input 
data = mx.symbol.Variable('data')
#first conv
conv1 = mx.symbol.Convolution(data = data, kernel = c(5,5), num_filter = 20)
tanh1 = mx.symbol.Activation(data = conv1, act_type = "tanh")
pool1 = mx.symbol.Pooling(data = tanh1, pool_type = "max", kernel = c(2,2), stride = c(2,2))
#second conv
conv2 = mx.symbol.Convolution(data = pool1, kernel = c(5,5), num_filter = 50)
tanh2 = mx.symbol.Activation(data = conv2, act_type = "tanh")
pool2 = mx.symbol.Pooling(data = tanh2, pool_type = "max", kernel = c(2,2), stride = c(2,2))
#first fullc
flatten = mx.symbol.flatten(data = pool2)
fc1 = mx.symbol.FullyConnected(data = flatten, num_hidden = 500)
tanh3 = mx.symbol.Activation(data = fc1, act_type = "tanh")
#second fullc
fc2 = mx.symbol.FullyConnected(data = flatten, num_hidden = 10)
#loss
lenet = mx.symbol.SoftmaxOutput(data = fc2)

#reshape the matrices into arrays
train.array = train.x
dim(train.array) = c(28,28,1,ncol(train.x))

test.array = test.x
dim(test.array) = c(28,28,1,ncol(test.x))

#train the CNN

mx.set.seed(0)
tic = proc.time()
model = mx.model.FeedForward.create(lenet, X = train.array, y =train.y,
                                    num.round = 10, array.batch.size = 100, 
                                    ctx = mx.cpu(), learning.rate = 0.05, momentum = 0.9,
                                    wd = 1e-5, eval.metric = mx.metric.accuracy,
                                    epoch.end.callback = mx.callback.log.train.metric(100))
Start training with 1 devices
[1] Train-accuracy=0.128600000776351
[2] Train-accuracy=0.795000001192093
[3] Train-accuracy=0.909600002765656
[4] Train-accuracy=0.947199997901917
[5] Train-accuracy=0.968800003528595
[6] Train-accuracy=0.977800009250641
[7] Train-accuracy=0.986400011777878
[8] Train-accuracy=0.989200010299683
[9] Train-accuracy=0.991400008201599
[10] Train-accuracy=0.993800005912781
preds = predict(model, test.array)
dim(preds)
[1]   10 5000
pred = apply(preds, 2, which.max)
pred = pred - 1
caret::confusionMatrix(as.factor(pred), as.factor(test.y))
Confusion Matrix and Statistics

          Reference
Prediction   0   1   2   3   4   5   6   7   8   9
         0 487   0   1   0   0   2   3   2   1   2
         1   0 530   5   2   1   2   2   4   6   0
         2   0   2 472   3   0   0   0  11   3   0
         3   0   1   7 513   0   5   0   1   3   1
         4   0   1   1   0 467   1   0   1   0   2
         5   0   0   0   6   0 418   2   1   3   0
         6   6   1   2   0   4   3 477   0   0   1
         7   1   0   2   0   0   0   0 501   1   5
         8   3   1   8   3   4   4   3   1 449   3
         9   0   1   2   2  14   2   0  11   7 503

Overall Statistics
                                          
               Accuracy : 0.9634          
                 95% CI : (0.9578, 0.9684)
    No Information Rate : 0.1074          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9593          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
Sensitivity            0.9799   0.9870   0.9440   0.9698   0.9531   0.9565
Specificity            0.9976   0.9951   0.9958   0.9960   0.9987   0.9974
Pos Pred Value         0.9779   0.9601   0.9613   0.9661   0.9873   0.9721
Neg Pred Value         0.9978   0.9984   0.9938   0.9964   0.9949   0.9958
Prevalence             0.0994   0.1074   0.1000   0.1058   0.0980   0.0874
Detection Rate         0.0974   0.1060   0.0944   0.1026   0.0934   0.0836
Detection Prevalence   0.0996   0.1104   0.0982   0.1062   0.0946   0.0860
Balanced Accuracy      0.9887   0.9910   0.9699   0.9829   0.9759   0.9769
                     Class: 6 Class: 7 Class: 8 Class: 9
Sensitivity            0.9795   0.9400   0.9493   0.9729
Specificity            0.9962   0.9980   0.9934   0.9913
Pos Pred Value         0.9656   0.9824   0.9374   0.9280
Neg Pred Value         0.9978   0.9929   0.9947   0.9969
Prevalence             0.0974   0.1066   0.0946   0.1034
Detection Rate         0.0954   0.1002   0.0898   0.1006
Detection Prevalence   0.0988   0.1020   0.0958   0.1084
Balanced Accuracy      0.9878   0.9690   0.9713   0.9821

Outside R platforms

Summary

  • Deep learning is blending logistic regressions
  • Can be used to do different things
    • Encode
    • Compete
    • Predict
  • Artificial still requires a lot of human intervention